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Abstract 



Studies evaluatiiig hypotheses about sources of differential item iunctiomng (DIF) are 
classified into two categories: observational studies evaluating (^rational items and randomized 
DIF studies evaluating ^)ecia]ly constructed items. For observational studies, advice is given 
for item classification, sample selection, the matching criterion, and the choice of DIP 
techniques, as well as how to summarize, synthesize and translate DIF data into DIF hypotheses. 
In randomized DIF studies of specially constructed items, specific hypotheses, often generated 
from observational studies, are evaluated under rigorous conditions. Advice for these studies 
focuses on the importance of carefully constructed items to assess DIF hypotheses. In addition, 
randomized DIF studies are cast within a causal inference ftamework, which provides a 
justification for the use of standardization analyses or logistic regression analysis to estimate 
effect sizes. Two studies that have components spanning the observational and controlled 
domains are summarized for illustrative purposes. Standardization analyses are used for both 
studies. Special logistic regression analyses of an item finom one of these studies are provided 
to illustrate a new approach in the assessment of DIF hypotheses using specially constnicted 
items. 
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EVALUATING HYPOTHESES ABOUT 
DIFFERENTIAL ITEM FUNCTIONING 
Alicia P. Scfamitt, Pkul W. Holland, and Neil J. Dorans 
Educational Testing Service 

1. INTRODUCTION 

Differential item functioning (PJF) research has had as one of its major goals the 
identification of the causes of DIP. Typically, DIP research has focused on determining 
characteristics of test items that are related differratially to subgroups of examinees and thus, 
which might explain or be a cause of DIP in an item. The premise has been, that, after 
identifying specific DDF-reiated fartors, test development guidelines could be generated to 
prevent their future occurrence. With the elimination of these DIP factors, the items would not 
exhibit DIP and, thus, the total score would provide a better estimate of the trae abilities of 
examinees from any subpopulation. The reality is that, to date, only a limited number of 
hypothesized DIP factors seem to hold consistently and thai even these factors need to be better 
understood so that test construction guidelines address them with the needed specificity. 

There are several reasons why progress in the identification of DIP-related factors has 
been slow. Pirst, the study of DIP is relatively new and so the initial emphasis was on the 
development of statistical methods to identify DIP. Dorans and Holland (in press) and Thissen, 
Steinberg, and Wainer (in press), provide good descriptions of the stateK)f-the-art statistical 
methods used to detect DIP. 
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Second, it requires a tbeoiy of differential item difficulty in a field in which theories of 
item difficulty are not well developed. Related to this is the fact that the reference and focal 
groups used to date-Blacks, Hiq)anics, Asians-Americans, and womra, for example-are very 
beterograeous and their differmces are not easy to describe. 

Thind, the identification process is complex* Since more than one factor could be related 
to DIF in a given item, zeroing in on the specific cause of DIF for one item is not a simple 
process and confirming studies designed to test hypotheses about the causes of DIF are rare. 
We hope this ps^r helps to stimulate further empirical work in this area. 

The purpose of this paper is to present and propose procedures for the systematic 
evaluation and corroboration of DIF-related factors or hypotheses. Descriptions of procedures 
to undertake observational DIF studies, to develop hypotheses, and to evaluate and construct 
items with the hypothesized factors are presented. Analytical comparison analyses are described 
and examples provided. 

The systematic evaluation of DIF hypotheses involves a two-stq) pnx^ss. The first step 
entails measuring DIF on regular operational items and using this information to generate 
hypotheses. The second stqp is a confirmatory evaluation of those hypotheses generated in step 
one. Thus, the main focus of the second stqp is the randomized DIF study in which specially 
conFtructed items are developed to test specific hypotheses and administered under conditions 
that permit appropriate statistical analyses to assess the efficacy of the hypotheses. 



2. OBSmVATIONAL STin>IES: EVALUATING OPERATIONAL 
Hypotheses about factors related to DIF can be generated on the basis of theoretical or 
empirical considerations. Theoretical DIF hypotheses are founded on prior knowledge 
pertaining to cognitive prcx^esses that ccnild be related to diifer»tial performance of test items. 
Although theoretical graeration of DIP hypotheses is concqmially the first and most reasonable 
way to postulate logical reasons for DIF, it has not been very fruitful. Most test construction 
practices are carefully developed to avoid obvious factors that are known or suspected to be 
possible sources of discrimination toward any subpopulation of examinees. Processes such as 
the Test Sensitivity Review Process used at Educational Testing Service are used to evaluate 
developed items to ensure fairness to women and ethnic groups. Tliis process is discussed by 
Ramsey (in press). Evaluation criteria for such sensitivity review procedures are designed so 
that items included in a test "...measure factors unrelated to such groups (minorities and 
women)" (Hunter & Slaughter, 1980, p.8). Therefore, logical or theoretical causes of DIF due 
to discrimination against women or ethnic minorities are supposed to be excluded from test 
instruments and thus can not be evaluated. 

Empirical DIF hypotheses, generated after analyses of DIF data, may suggest that certain 
characteristics of items are differentially related to one or more subgioups of the population. 
Observational studies refer to investigations that make use of data and items constructed and 
administered under operational conditions. DIF analyses are conducted for all items in these 
tests to evaluate whether any item exhibits differential functioning by women or minority 
examinees. Performance of women on each item is compared to the performance of matched 
men (reference group for the female focal group) while the item performance of each minority 
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gnHip (e.g., Asian-American, Black, Hi^janic, and Native-American examinees) is contrasted 
to that of comparable White examinees (rrferaice group for each minraity focal group) . Results 
of these DIF analyses can provide «npirical information to gaierate DIF hypotheses. 

2.1 DEVELOPMENT OF HYPOTHESES: EXTREME DIF ITEMS. 

Evaluation of itans with extreme DIF can provide insight into factors that might be 
related to DIF. Such a process involves a careful examination of the items with extreme DIF 
by a variety of ejqperts. The speculation about or insight into possible causes of DIF for these 
items from test developers, researchers, focal group members, cognitive psychologists, and 
subject specialists can be used to generate hypotheses. Differential distractor information can 
oigender additional insight into causes of DIF. Knowledge about which distractors differentially 
attract a specific subgroup may help to understand the respondents' cognitive processes. 
Differential distractor analyses are described by Dorans and Holland (in piess) and Thissen, 
Steinberg, and Wainer (in press). Usually, analyses of more than one test form might be 
required in order to observe commonalities across items identified as having extreme differential 
performance. Some of the generated hypotheses might only consist of a general speculation or 
"story" about sources of DIF. It is important to consider any possible ejq)lanation. Since this 
stage is a generation -of-ideas phase, it can be considered almost a "brainstorming" process. 
Those possible explanations deemed most reasonable can then be developed into hypotheses to 
be tested. 
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2.2 EVALUATION OF HYPOTEIESES THROUGH OBSERVAllONAL DATA 

Once a number of possible hypotheses have been identified, the next stq> is to evaluate 
the efdcacy of these hypotheses. Procedural stqps to evaluate DIF hypotheses using 
observatioiial data are delineated below. 

Classification of Items 

In order to evaluate the hypotheses, all items of a test form under study need to be 
classified with respect to the various hypothesized item factors or characteristics. A clear and 
precise definition of the factors to be studied needs to be provided. At least two experts or 
judges should classify each item according to each hypothesized factor. In cases where the two 
judges disagree, a third expert should be consulted. Id this fashion, each item is identified as 
containing or not containing the factor or item characteristic under evaluation. Typically, a 
dichotomous classification is coded for each item factor. In those cases when a factor might 
consist of gradients or levels, a more continuous classification is ^ropriate. In addition, 
istbnnation about related variables might also be identified and coded. For example, the 
location of the item factor of interest (i.e., in the stem, key, or distractors) or the item type 
(e.g., antonyms, analogies, sentence completion, reading comprehension for verbal items) might 
provide information relevant to the relation of the factor to DIF. In fact, current research has 
shown that the greatest relationship between true cognates and DIF for Hispanic examinees is 
found when aU components of an item have troe cognates and the next greatest effect is found 
for those items with true cognates in the stem and/or key. On the SAT, these relationships were 




found to be most notable for antonym and analogy item types (Schmitt, 1988; Schmitt, Curley, 
Bleistein, & Doians, 1988; Schmitt & Doians, 1990b). 

Sampling Procedures 

Groups 

DIF factors can be postulated to be related to the differential itsm performance between 
two groups of examinees. In some instances, a postulated factor might not be specific to any 
one group. In such cases, more than one focal group might be of interest in a particular study. 
Typically, focal and reference groups have been determined on the basis of their gender and/or 
race or ethnic origin (i.e., females as focal group with males as reference group and Asian 
Americans, Blacks, Hispanics, or Native Americans as focal groups with Whites as reference 
group)- Nevertheless, other characteristics (e,g,, income level, educational background, or 
language knowledge) can serve to either further delimit ethnic or gender groups or to define 
other distinctive groups of interest. How focal and reference groups are determined and 
delimited dq)ends both on the population characteristics of the examinees for whom the test is 
designed and intended as well as on hypothesized group characteristics. Cautious circumspection 
on the number of characteristics chosen to determine the groups under study is recommraded. 
As the number of group-delimiting variables increases, the sample size of these groups is 
consequently restricted- Moreover, when several variables determine a group, findings about 
factors related to DIF are harder to inteipret and their effect harder to ascribe to specific group 
variables. 



All possible examinees on each focal and referaice group should be used when doing DIF 
research. Because the comparison of comparable groups of examinees is an important 
componrat in the calculation of DIF statistics, differences on item performance of focal and 
reference groups are calculated at each ability level. Ability levels based on a predetermined 
criterion (e.g., total test score or another related ability measure) are used in the computation 
of DIF indices in a fashion analogous to how a blocking variable is used in a randomized block 
or in a q)lit-plot design. For this reason, a reasonable number of examinees at each ability level 
is essential. The largest possible number of examinees in both the reference and focal groups 
should be used to render stable DIF estimates and to ensure sufficient power to detect DIF 
eifects. The standard error of the DIF statistic should be examined to help inteipret results 
when samples are small. Dorans and Holland (in press) and Donoghue, Holland, and Thayer 
(in press) discuss the standard error formulas and their accuracy. 

DIF Analyses Procedures 

Statistical Procedures 

What statistical measure of DIF to use when conducting observational DIF studies is no 
longer the controversial decision it once was. The notable development and comparison of 
several DIF statistical methods during this decade have produced methods that, not only are 
reliable, but that generally have good agreement (Dorans, 1989; Dorans & HoUand, in press; 
Dorans & Kulick, 1986; Holland, 1985; Holland & Thayer, 1988; Donoghue, Holland & 
Thayer, in press; Thissen, et al., in press; Scheuneman & Bleistein, 1989). Moreover, use of 
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more than one statistical method may be recommended. Currently, the operational assessment 
of DIF at Educational Testing Ssrvice uses the Mantel-Haenszel procedure to flag items for DIF 
and the closely related standardization procedure as a statistical tool to generate and assess 
content-based e3q>lanations for DIF. As mentioned previously, in addition to statistical indicants 
of DIF for the correa response, the developmoit and evaluation of DIF hypotheses benefit from 
diffei«ntial infonnation on distractor selection, omitted reqxmses, and speededness. Similarly, 
evaluation of empirical-option test regression curves and conditional differential response-rate 
plots for all these responses can indicate if any DIF effect is dependent on ability. Refer to 
Dorans and Holland (in press), and to Dorans, Schmitt, and Bleistein (1988) for descriptions of 
how to iqjply the standardization method to the computation of differential distractor, omit, and 
speededness functioning. Use of a log-linear model to examine DIF through the analysis of 
distractor choices by examinees who answered an item incorrectly is described by Green, Crone, 
and Folk (1989). Also see Thissen, et al (in press), for a discussion of differential alternative 
functioning (DAF). 

Matching Criterion 

The comparability of the focal and reference groups is achieved by matching these groups 
on the basis of a measure of test performance. Typically this measure is the total score on the 
test to be evaluated for DIF and is sometimes referred to as an internal matching criterion. 

The major consideration in the selection of an appropriate DIF matching criterion is the 
degree of relationship between the construct of interest and the criterion. For DIF analyses, the 
construct of interest is what the test item is constructed to measure. If the total score matching 
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criterion is multidim<»[isioDal, it will be measuring more than the constnict of interest and may 
not be highly related to the item. Use of such a multidimensional total score criterion could 
compromise the comparability of the groups for a q)ecific test item. 

Another possible source of error in the estimation of a comparable total DIF matching 
criterion for the focal and reference groups is differential ^peededness (Dorans, Schmitt & 
Bleistein, 1988), Several studies are currently evaluating the effect of differential speededness 
and are considering this proposed speededness refinement (Schmitt, Dorans, Crone & 
Maneckshana, 1991). 

Differential Response Style Factors 

Different examinees approach the test taking experience differently and these different 
response style factors may have an impact on DIP assessment, particularly for items at the end 
of test sections. These response style factors are differential speededness and differential 
omission (Schmitt & Dorans, 1990). When an examinee does not respond to an item and does 
not respond to any subsequent items in a timed test section, all those items are referred to as 
"not reached". Differential speededness refers to the existence of differential response rates 
between comparable focal and reference group examinees to items appearing at the end of a test 
section. When an examinee does not respond to an item, but responds to subsequent items, that 
lack of response is referred to as an ""omit". Differential omission refers to the occurrence of 
differential omit rates between comparable focal and reference group examinees. Adjustments 
for these differential response styles are important when evaluating DIF hypotheses because their 
occurrence can confound results. 
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Descriptive Statistics 

After all items have been classified and DIF indices estimated, DIF summary statistics 
are computed for each level of the factor. TTie unit of analysis in this stq) is the item. Means, 
medians, minimum and maximum v^ues, and standard deviations can he^ evaluate the impact 
of the postulated factor. Examination of all items classified as containing the factor under smdy 
allows for the identification of items that differ from the positive or negative pattern expected 
for the factor. Closer examination of such items can provide valuable infonnation about possible 
excq)tions to the expected effect. Although correlation analysis can lender useful associative 
information, use of this descriptive statistic is limited. The dichotomous nature of most of the 
hypothesized factors and die limited number of naturally occurring items with such fartors 
restrict tiie usefulness of statistical significance tests. Furthermore, the lack of controls 
particular to naturalistic studies also hampers the evaluation of DIF hypotheses. Nevertheless, 
naturalistic studies are a good first step, pro-.iding information valuable for the postulation of 
DIF hypotheses and data for their evaluation and refinement. 

Confirmatory studies are a natural next step in the evaluation of DIF hypotheses. These 
studies require tiie construction of items with the postulated characteristics and use scientific 
methods to ensure that extraneous factors are controlled so that the factors of interest can be 
accurately evaluated. 

3. EXAMPLES OF OBSERVATIONAL DIF STUDIES 
Two observational studies which have evaluated DIF hypotheses previously postulated 
on the basis of DIF information wUl be described and findings r^rted. 
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3.1 UVNGUAGE AND CULTURAL CHARACISRIS 



An in-dq>th analysis of the extime DIF kerns for Hispanic examinees on one fonn of 
the Verbal Scholastic Aptitude Test (SAT-V) helped idratify characteristics of the items that 
might explain the diffenratial functioning by two Hispanic subgroi^s. Four hypotheses were 
generated about the differential item functioning of Hispanic examinees on verbal 2q>tLtude test 
items. These hypotheses were: 

1. True cognates, or words with a common root in English and Spanish, will 
tend to favor Hispanic examinees. Example: music (musica). 

2. False cognates, or words whose meaning is not the same in both 
languages, will tend to impede the performance of Hispanic examinees. 
Example: enviable-which means "sendable " in Spanish. 

3. Homographs, or words that are spelled alike but which have different 
meanings, will tend to impede the performance of Hispanic examinees. 
Example: bark. 

4. Items with content of special interest to Hispanics will tend to favor their 
performance. Most special-imerest items will tend to be reading 
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comprehensia^ item types. 
American ^vamen. 
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Example: reading passage about Mexican- 



Vox a detailed desciiption of the study see Schmitt (1985, 1988). 

Although some of the hypotheses were siq)poited by the data (troe cognates and special 
interest ^)ecific to the Hiq)anic subgroup), the low frequracies of false cognates and 
homographs, in the two test editions studied, precluded their evaluation. Construction of forms 
where the occurrence of the postulated factors could be controlled and, thus, evaluated was 
proposed as a follow-up to this investigation. The Schmitt et al., (1988) foUow-up study 
remedied the limited naturally occurring item factors by developing items with these factors and 
administering them in non-operational SAT-V sections. Procedures and results of this 
confirmatory investigation are described and rqx)rted in section 5.1. 

3.2 DIFFERENTIAL SPEEDEDNESS 

In an effort to identify factors that might contribute to DBF, Schmitt and Bleistein (1987) 
conducted an investigation of DIF for Blacks on SAT analogy items. Possible factors were 
drawn tom the literature on analogical reasoning and previous DIF research on Black examinees 
(Dorans, 1982; Echtcmacht (1972); Kulick, 1984; Rogers & Kulick, 1987; Rogers, Dorans & 
Schmitt, 1986; Scheuneman (1978); Scheuneman, 1981; and Strieker (1982)). Schmitt and 
Bleistein performed their research in two Steps. Hypotheses about analogical DIF were 
developed after close examination of the three 85-item SAT- Verbal test forms studied by Rogers 
and Kulick (1987). Following these analyses, two additional test forms were studied to validate 
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the hypothesized factors. Standardization analyses were conducted in which the key, all 
distiactors, not leached, and omits all served as dsp&adent variables. 

Hie major finding was that Black studrats do not complete SAT-Veibal sections at the 
same rate as White students with comparable SAT-Verbal scores. This differential speededness 
effect appeared to account for much of the n^ative DIF for Blacks on SAT Verbal analogy 
it^s. Whra examinees who did not reach the item were excluded from the calculation of the 
standardized response rate differences, only a few analogy items exhibited DIP. 

Dorans, Schmitt and Bleisteir (1988) set out to document the differential speededness 
phenomenon for Blacks, Asian-Americans and Hispanics on several SAT test editions, including 
some smdied by Schmitt and Bleistein (1987). They found that differential speededness was 
most noticeable for Blacks and virtually nonexistent for Asian- Americans when compared with 
matched groups of Whites. A randomized DIF study to evaluate differential speededness under 
controlled conditions is described in section S.3. 

4. EVALUATING SPECIAIXY CONSTRUCTED ITEMS 
...we have not yet proved that amecedem to be the cause until we have reversed the 
process and produced the effea by means of that antecedent anificially, and if, when we 
do so, the effea follows, the induction is complete.... (Mill, 1843, p. 252) 

In contrast to observational studies that evaluate operational data and can only draw 
associational inferences about DIP and item characteristics, well*designed randomized studies 
using speciaDy constiucted test items can be used to draw causal inferences about DIP and 

i 3 



14 

postulated item factors. It is not until we can confinn an associated relation between item 
factors and DIF by verifying that the expected DIF is found on specially constracted items with 
these characteristics, that we can ascribe these factors to be a cause of DIF. The basic features 
of these randomized DIF studies is the use of control or comparison itrais and landomized 
e^qx>sure of examinees to these items. Hie purpose of this section is to describe procedures for 
constructing these items, for designing these systematic investigations, and for analyzing their 
results. Examples of two studies where these techniques have been s^lied are presented. 

4.1 MEIHOD AND DESIGN 

Variables 

The variables used in randomized DIF studies may be described by the following 
terminology ad^ted from experimental design: response variables, treatment variables, and 
covariates. The dqpendent variable is the measure of the behavior predicted by a DIF hypothesis 
(e.g., choosing a predicted response). The treatment variable indicates the extent to which the 
item has the postulated DIF characteristics. Covariates are subject characteristics that are not 
affected by exposure to a particular treatment, i.e., measures of perfonnance on related types 
of items or measures of education level and English language proficiency. 

Instrument Development 

Some of the treatments in a randomized DIF study consist of exposure to test items that 
have been specially constructed to disadvantage one or more groups of examinees. For this 
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reason, great care must be exercised, in the constiuctioQ of these testing instniments, to conform 
to test SCTsitivity and human subject guidelines as well as to test q>ecifications and sound test 
develqnnrat practice. These constraints on the final testing instruments insure that operational 
scores are not affected by the study* 

At least two levels of the treatmrat are needed in order to compare the effect of the DIF 
factor of interest. Thus, two versions of each item are developed. Ideally, these two items are 
identical in every respect but for the factor to be tested. The item that includes the postulated 
DIF characteristic is the "treated" level (t) of the treatment variable. The other item (the version 
that excludes the DIP factor) is the control level (c) of the treatment variable. The goal of the 
construction of these pairs of items is to make them as similar as possible except for the factor 
that is being tested. Parallelism of the two item versions is an important requirement that may 
allow us to infer that differences in the differential performance on these two items is caused by 
the difference between the items i.e., the postulated factor. Achieving parallelism of the item 
pairs is often difficult to do in practice because test questions are complex stimulus material and 
a change in one aspect of an item often entails other changes as well. 

In order to issue parallelism, when constructing parallel items it is important to control 
the following item characteristics: difficulty, discrimination, location in the test, item type, 
content, and the location of the key and distractors. The number of items used to test each 
factor is also an iir:;X)rtant consideration because the unit of analysis is the item itself. Thus, 
it is desirable to construct several item pairs testing each factor. 

The items in a pair are constructed to test a specific DIP hypothesis and are designed so 
that the treatment item (t) is more likely to elicit a particular type of response than is the control 
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item (c) in the pair, especiaUv for members of t he focal ptntp of interest , Schmitt et al. (1988) 
give the following example of a pair of specially constnicted antonym items used in their 
randomized DIF study discussed further in section 5*1. 



Item t 
PALLID: 

(A) moist 

(B) massive 

(C) * vividly colored 

(D) sweet smelling 

(E) young and innocent 



Item c 
ASHEN: 

(A) moist 

(B) massive 

(C) ' vividly colored 

(D) sweet smelling 

(E) young and innocent 



These two items are identical except for their stem, i.e. PALUD or ASHEN, which are 
synonyms of roughly equal frequency in English. The factor being varied in this item pair is 
the existence of a Spanish cognate for the stem word. In this case, palido is a common word 
in Spanish while the cognate, Eallisi, is a less common word in English. The DIF hypothesis 
for this item pair is that the existence of a common Spanish cognate for a relatively more rare 
English word that plays an essential role iQ the item (iQ this case the stem word) will help 
Spanish speaking examinees select the correct answer and will not help non-Spanish speaking 
examinees. 



Samples 

The relevant reference and focal groups are determined by the DEF factors that are 
postulated. In a randomized DBP study, the reference and focal groups are then subdivided at 
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random into subgnx^s of examinees who are exposed to either the r or c version of each item. 
Because of this subdivision, it is important to have large samples of each of the target groups. 

Controls 

Randomized DIF studies are by definition controlled studies. The use of control or 
comparison items allows us to infer that differraces in DIF between the item pairs is caused by 
differences between the items (the postulated factor) as long as other causative variables are not 
contaminating the results. There are three major types of extraneous variables that can 
contaminate results if not controlled: examinee related differences, lack of parallelism of the 
item pairs, and differences in the testing conditions oif examinees taking each of the item pairs. 
The control of these extraneous variables needs to be carefully considered. Hie types of controls 
that can be used for this purpose are: randomization, constancy, balance, and counteibalance. 

Randomization 

If the examinees who are exposed to the f version of an item pair differ in important 
ways from those exposed to the c version, confounding is said to occur. Confounding makes 
it difficult to separate the effect of responding to the r versus the c version of the item pair from 
the characteristics of the examinees in these two groups. Randomization tends to equalize the 
distribution of examinee characteristics in the r and c groups. It may be achieved by spiralling 
subfonns together each containing only one number of each pair of special items. We discuss 
the effea of randomization more formally in section S.2. 
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Constancy 

Some factors, such as the position of the key in a multq)Ie choice question, can affect 
examinee responses and should be the same for items r and c in a givwi pair. This is an 
example of constancy. Other factors that might be controlled using this technique are item 
position and response options. The PALLID/ASHEK item pair is an example with a great deal 
of constancy across the pair of items. A special case of constancy arises when a factor is 
eliminated in the sense that it is prevrated by design from occurring. An example of a factor 
that can be eliminated in a randomized DIF study is differential si^edness. It is lemoved by 
placing the specially constructed items at the beginning of the test section. 

Balance 

Balance is used in two distinct ways. On the one hand, it can refer to equalizing the 
distribution of important examinee characteristics across the r and c versions of an item pair. 
Randomization will approximately balance the distribution of covariates in a large study, but in 
a small study the researcher may need to achieve balance in a more active way (i.e. , blocking). 
On the other hand, balance can refer to the entire set of stimulus material that an examinee is 
e)qx)sed to. Subforms arc usually balanced with respect to content and item type so that they 
will not appear unusual to the examinees and thereby will not elicit unrq)resentative responses. 
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Counteibalance refers to the stimulus material presented to the examinees. A factor that 
is q^propriate to counterbalance across the subforms used in the study is the total number of 
occurrraces of t and c items in each subform. When this is counterbalanced, all the subforms 
will have the same number of t and c items ev<m though they cannot contain the same t and c 
items. This will tend to reduce the overall effect of each subform on differences in subgroup 
performance. 

Other Considerations 

Hie form of control used has an effect on the generality of the inferences made from the 
study. For example, if only one level of item difficulty is used in the evaluation of an 
hypothesis (i.e., constancy) then any resulting effect of the hypothesized DDF factor under study 
may be restricted to items with the tested level of difficulty. It is, therefore, important to select 
the method of control (i.e., balancing, etc.) based on the level of generality that is desired. 

Another way to deal with extraneous variables is to control them in the design of the 
study. In such cases the DIF factor will be one indq>endent variable while another variable, 
such as item difficulty, will be a second indq>endent variable. In this example, we develop item 
pairs that have similar difficulty within a pair, but varying difficulty across pairs. When more 
than one indq)endent variable is being studied at ont time, evaluation of their interactive effect 
is a part of the study. If an interaction is found then analyses should proceed to see how the 
effect of the DIF factor varies across the levels of the interacting variables. Because the 
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outcome measuies in DIF studies are usually res^nse probabilities, the scale in which these are 
measured, P or logit, may affect the study of interactions. 

A major constraint on using several independent variables in designing the item pairs is 
that the number of items has to be increased accordingly and the study made more complex. 
Several items need to be constracted to evaluate each possible combined effect. For example, 
if the DIP factor under study consists of two levels (DIP factor present or absent) and the other 
indq)endent variable consists of three levels (e.g. , item difficulty: hard, medium, and easy) then 
the total number of subgroups of items testing all possible combinations is six. If there are then 
at least two examples of each item there are at least 12 items to study a single DIF factor. 
Because of practical testing constraints, it may be necessary to limit the number of indqpendent 
variables to be studied at a time in a randomized DIF study. 

4.2 A CAUSAL INFERENCE PERSPECTIVE ON RANDOMIZED DIF STUDIES 

This section adapts the formal model of Rubin (1974) and Holland (1986) to the analysis 
of randomized DIF studies. 

Dependent Measures 

In a randomized DIF study the basic dqpendent measure is the response an examinee 
gives to the q)ecially constnicted test items. Assuming that we are considering multiple choice 
tests, the responses of examinees are limited to choosing one of the response options or omitting 
the item. It is also possible that an examinee might not attempt to respond to some items but 
oni analysis will condition on responding to the special items. Depending on exactly what DIF 
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hypothesis is being tested, the particular response of interest will vary. For many hypotheses 
the behavior of interest is choosing the correct answer. For others it might be choosing or not 
choosing a particular distractor and for others it will be the decision to omit the item. We will 
not make an assumption on this and will let the outcome variable, Y, be dichotomous with 

1 if the examinee makes the predicted 
^ response relevant to the DIF hypothesis, 

0 otherwise 

(In all of our examples, however, we use 7 == 1 to denote choosing the correct answer to the 
special items.) 

There are two potential responses that could be observed for an examinee, Y,, or 7^, 

where 

== the value of Y that will be observed 
if the examinee is asked to respond 
to item t of the pair, 

Yc = the value of Y that will be observed 
if the examinee is asked to respond 
to item c of the pair. 

The difference, Y^ - F^, is the "causal effect" for a given examinee of being asked to respond to 
item / rather than to item c in the pair. Let S denote the member of the pair of special items 
to which the examinee is asked to respond, i.e. 5 = / or 5 c. Then Y^ is the actual response 
that the examinee gives m the study. The notation Y^ means the following: 



y, if s = f 



if 5 = c 
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The problem of causal infemce in a landomized DIF study is to say as much as we can 
about the unobservable causal effect, Y, - Y,, for each examinee from the observable data. For 
example, if 7, - 0 then the examinee would make the same response xegaidless of the 
version of the item to which he or she is e)q)osed. When - 1; = 1 then the examinee would 
make the predicted re^xmse to / but not to c, etc. 

The Data 

So far we have mentioned two pieces of information that are available from each 
examinee with respect to a given pair of items, the observed refuse 15 and the member of the 
pair of special items to which the examinee responded, 5. In addition there is other mipoitant 
jiformation. First of all, the examinee may belong to the focal or the reference group of 
mterest, or possibly to neither one. Denote group membership by G = r or /(reference or 
focal). In addition there may be other test scores available for the examinee. For example, the 
special items may be part of a larger test. Let X denote an additional score obtained from part 
or all of this larger test. We must distinguish two important cases. If it is possible to assume 
that the score X is unalfccted by whether or not the examinee was asked to respond to item / or 
to item c of the item pair of interest then X is called a covariate score . For example, if the 
items that are part of the X-score are all asked prior to the examinees being asked to respond 
to the special items then it is usually plausible to assume that X is a covariate score. On the 
other hand, if the ^)ecial item is included in the X-score, then X is not a covariate score. We 
will use covariate scores td group examinees. 

In summary, the data observed for a given examinee can be expressed as 
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Ys, S, G, and X. 



The Average Causal Effect 

The individual level causal dfect, Y, - y^, is not directly observable for a single examinee 
because we only observe Y, or Y^ (but not both) on each examinee. An average causal effect 
(ACE) is found by averaging the individual level causal effects over various groups of 
examinees. For example we might consider 



(1) 



the average over everyone in the study, or 



E(y,-y,iG=A 



(2) 



the average over everyone in the focal group, or 



E(y, - n 1 G = r), 



(3) 



the average over everyone in the reference group. Finally we might consider 



E(y, - yj G = X = X), 



(4) 



do 
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the average over everyone in fftoap g with covariate score x. We will show later that average 
causal effects can be estimated by the data obtained in a randomized DIF study, even though the 
examinee-level causal effects can not be estimated. 

Let us consider the ACEs in (1) - (4) further Because and are dichotomous, the 
e3q;)ectations in (1) - (4) may be e^qpressed in terms of probabilities, i.e. 

E(y,.r^ = p(y, = i)-p(y, = i), (5) 
E(y, ^ yj G = g) = = 1 1 G = ^) • P(i; = 1 1 G - g), (6) 

E(y, - yj G - ^, X = a:) = P(y, = 1 I G = ^, X = X) - />(y, = 1 1 G = ^, X = X). (7) 

The ACE, E(y, - y^), averages over all examinees and as such represents the "main 
effect" of item / relative to item c over all examinees. While this main effect is important, it 
is not the primary parameter of interest in a randomized DIF study, Wien Y represents 
choosing the correct option in a multq)le choice test, the main effect (5) is simply the difference 
in the percent correct for items / and c over the examinees in the study. As we shall see, it is 
desirable to construct t and c so that the main effect (5) is small. 

In general, the idea behind a randomized DIP study is that item / will elicit a bigger 
change in the probability of the predicted response relative to item c for members of the focal 
groups then it does for members of the reference group. This leads us to examinee the ACE- 
difference or interaction paiameter defined by 
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r = E(y, • ni G = /) • E(y; • yj G = r), (8) 

« TO = 1 I G =^ . /^y, = 1 I G = r)] . [TO = 1 1 G =^ . = 1 1 G == r)].(9) 



It is useful to imember the two ways of writing Tin (8) and (9). In (8) lis expressed 
as the diffeience between the ACE in the focal group and AGE in the reference group. In (9) 
r is e>q)ressed as the difference betwera the t and c items in their respective differences in the 
probability that y = 1 between the focal and reference groups. When y = 1 indicates a correct 
answer, the difference in the probability that y = 1 between the focal and reference groups is 
called the impact of the item (Holland, 1985). Thus, in this case T may be viewed as the 
difference in the impact of item t and item c. 

When r in (8) is positive it means that the change in the probability of the predicted 
responses caused by t (relative to c) is larger for the focal group than it is for the reference 
group (i.e. , the ACE for /is larger than the ACE for r). Typically, this is the type of prediction 
made in a DIP hypothesis. 

One problem with a parameter like T is that the probability of the predicted behavior 
measured by y, or will often differ between the reference and focal group, that is 
PiYj = 1 I G = r)mdP(Yj = 1 I G will not be the same. It may also differ between item 
t and c, i.e. if there is a **main effect** of items in the pair. When these differences are large, 
the interpretation of the magnitude of T is complicated by the boundedness of the probability 
scale (i.e., the fact that y is a 0/1 variable). Consider these four examples in which y denotes 
selection of the correct response for a pair of items, (/, c). 
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EaaoBlc^: = 1 1 G = r) = /xn = 1 I C? = r) = i^y, = 1 I G = .5, 
andPCy, = 1|G=^ = .4 

then 

r= (.5- .4)-(.5-.5) = .1. 

Tht ACE in the reference group is 0 while in the focal group it is .1, so that 7 is .1. In this 
case items c and t are equally difficult for the reference group and t is equal in difficulty for the 
refeinice and focal groups. Furthermore, c is more difficult than t for the focal group. This 
is an ideal type of example in which some characteristic of item c causes it to be harder for 
members of the focal group and when this is altered to item t the item is equally difficult for 
both the reference and focal groups. 

Example B : P(y, = 1 IG = r) = .55, P(y, = 1 1 G = r) = .45, 
P(y, = 1|G =;) = .50, P(y, = 1|G =/) = .30. 

then the ACE for r is .1 and the ACE for/is .2, so that T = .2 - .1 = .1, again. 

This is a more realistic example than example A because there is both a group difference and 
an item difference. Still it is evident in this example that the change from item c to item ; has 
a bigger average causal effect on members of the focal group than it has for members of the 
reference group. 
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EsanffilJLC: = 1 I G - r) = .95, - 1 I G - r) = .85 
P(y, = 1 1 G =y) = .70, = 1 ! G? - -50 

then 

T « (.70 - .50) - (.95 - .85) = .20 - .10 = .1, once more. 

The value of 7 is the same as in examples A and B but the context is quite different. 
Both c and t are much easier for the reference group than they ar& for the focal group and for 
both groups item t is somewhat easier than item c. The ACE for / is .70 - .50 = .20 but the 
ACE for r is only .95 - .85 = .10. However, the boundedness of the probability scale makes 
it impossible for P{Y, = 1 ] G = r) to exceed P{Y^ = 1 1 G = r) by .20 when the latter 
probability is .85, as in this example. Does 7 - .1 mean that the change from c to r had a 
bigger effect for members of / then for members of r or was c already so easy for members of 
r that the change to t could not improve their performance as much as it did for members of /? 
This ambiguity stems from the large difference in performance on c and t between the reference 
and focal group. The use of covariate scores is aimed at removing some of this confusion-as 
we discuss below. 



Example D : P(y, = 1 | G = r) = .95, P{Y,^\\G = r)^ .60, 



and 

?(y, - 1 1 G =^ = .85, F(y, = 1 1 G =y) = .40. 



In tins case 
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r=C85-.40)-C95-.60) = .1 

as in the other examples. This example is like Example C excqn that the roles of the groups 
and the itons have been reversed. In this example, there is a laige main effect of items-item 
/ is much easier for both groups than is item c. The consequence of this large main effect is that 
it confuses the interpretation of r. The ACE for/is .85 - .40 = .45 while the ACE for r is .95 
- -60 = .35, however, starting with P(Y, = 1 1 G = r) = .60 it is impossible for the ACE for 
r to exceed .40. Again the boundedness of the probability scale is a source of confusion in the 
interpretation of T. 

The Use of Covariate Scores 

Examples C and D show that the boundedness of the probability scale can confuse the 
interpretation of the parameter T when there are large differences between the reference and 
focal groups in their probabilities of producing the predicted response for items r and or when 
there is a large main effect of itzms. The introduction of a covariate score can help alleviate 
this problem when there are large group differences. Large main effects of items are geuerally 
a sign of a poorly designed item pair for a DIP study. 

Suppose X is a covariate score in the sense described earlier, i.e. , X is measured on each 
examinee in the study and is not affected by exposure of the examinee to items t or c. Suppose 
further that examinees with the same X-score have similar probabilities of making the predicted 
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lesponse to items t and c r^aidless of whether they are in the reference or focal group. That 
is, siqjpose that P{Y^ = 1 |G = r, X = x) and 

= 1 |C =/, X = X) arc similar in value and that P(y, = 1 |G = r, X = x) and , 
P{Yc = 1\G =/X = x)arc also similar in value. This latter assumption is what we mean by 
a useful covaiiate score. If the predicted bdiavior is choosing the correct response to items t 
and c then candidates for useful covariate scores are number right or formula scores based on 
sets of items that measure the same ability that is measured by items t and c. When the 
predicted behavior is choosing a particular distractor or omitting the item, number right or 
formula scores on other items may not produce a sufficiently useful covariate score and it may 
be necessary to augment test score with other variables, or to defme scores based on special 
choices of distractors. 

When X is a covariate score we can examine a third parameter based on the ACEs in (7), 
Defme r(x) by 

TXx) = E(y, • yj G =/, X = X) - E(y, - yj G = r, X = X), (10) 

= [P(y, - 1 1 G X = X) - P(y, - 1 1 G X - X)] 

- [P(y, = 1 I G = r, X = X) - = 1 1 G = ^ ^ = ^)]- (11) 

Hie causal parameter, 7Xx), is an interaction like 7 but is conditional on each X-score. When 
X is a useful covariate score and the main effect (1) is small the four probabilities in (1 1) will 
be similar and the boundedness of the probability scale will not confuse the inteiprctation of T{x) 
to the degree that it can for 1. 



30 



Even though T(pc) can bt)p clarify the lesuhs of a comparison of responses to r and c for 
members of the referaice and focal groi^, it does introduce the ackkd complexity of a whole 
set of parameter vahies, one for each vahie of X, rather than just a single value. When X is a 
univariate score this plethora of parameters can be handled by a gn^h of T(pc) versus x. When 
X is a multivariate set of covariate scores this solution is not as he^fiil. 

One way around this pl^ora of parameters is to average T(pc) over some distribution of 
X-values, w{x), where wix) S 0, ^yv(x) = 1. This results in a new parameter T„ defined by 



The choice of \v{x) matters, and is somewhat arbitrary. In the standardization DIF pnJcedure 
(Dorans & Holland, in press), the distributions of X in the focal group is often used as weights, 
i.e.. 



= E TXx) w(x) 



(12) 



= Z;iP(y, = 1 I G =/, X = j:) - P(y, = 1 1 G = r, X = X)] w{x) 
-^[PiY, = 1 \ G ^f,X = X) - PiY, = 1 \ G = r,X = x)]w(x). 



(13) 



Mx)=P{X=x\G=f). 



This leads to the parameter that we denote by 7) given by 



Tf='£nx)P(X = x\G-^f) 
= ^[P(Y, = 1 \ G =f,X = X) - PiY, = 1 \ G = r,X = X)] nx =^ X \ G =A 
- S[P(y, = 1 I G =/, X = a:) - /Xn = 1 I G = r, X = X)] /»(X = a: I G =y). 



(14) 
(15) 



If we let 
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A, = - 1 1 G X = ;c) - - 1 I G = r, X = X)] = or I G (16) 

and 

A, = EITO = 1 I G =/, X - or) - TO « 1 I G = r, X = or)] P(X = or I ^ =/) (17) 

then 

2>= A,-A,. (18) 
In the case where X is a number right or formula score and the predicted behavior is selecting 
the correct response for items t and c, A, and A^ are the parameters estimated by the 
standardization DIP procedure. Hence, 7} may be interpreted as the difference between 
standardization DIF parameters for items t and c. 

At this point, it is worth stopping for a moment and asking why do we pay so much 
attention to the ACE parameters given in (1) - (4). After all, in computing a DIF measure for 
an item we compare the performance of matched focal and reference group members on the 
studied item and this is not an ACE parameter. To make the comparison sharper, in computing 
a DIP measure for an item using the standardization methodology the basic parameters are the 
differences 

P{Yj:=^l\G X* = oc) . F(y^ = llG - r, X- « jc) (19) 
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for a fixed item j ^ tor c and a score X* that includes the score on the studied itrai l^ In 
contrast, the concluding AGB is 

TO == 1 |G = ^, X = X) - P(y, = 1|G = 5, X = X) (20) 

where g is a particular group, referwice or focal, and X is covariate score that does asJ include 
the studied items. 

The motivation for our emphasis on the ACE parameters is a causal model that underlies 
the observations. Consider the joint distribution of the two variables (F,, over the set of 
examinees for which G ^ g and X = x. Let this (conditional) joint distribution be denoted by 

Puy^ = P{Yr = «, Fc = v|G = X =x), (21) 

Thus, for example, is the probability that a focal group member with covariate score X = 
X will give the predicted refuse if responding to item t but will not give it if responding to 
item c. In this sense, then Pxf^ is the probability that item t causes the predicted response for 
focal group examinees with covariate score X = x. The values of are "causal parameters ' 
in this special, but clear-cut, sense. Notice that 

?(y, = liG = ^, X = X) - + Aq^ (22) 

*See Holland and Thayer (1988) and Donoghue, Holland and Thayer (in press) for a discussion of 
why inclusion of the studied item in the matching variable is important for both the ManteKHaenszel and 
the standardization procedures. 
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Hence, the ACE parameter given in (4) can be expressed in tenns of the causal parameters in 
(21) in the following way: 



E(y,-y,iG = g,x = x) 

= (Pu«x + PlOgd - (Pllfx + Poip) 

= PlQfx - Poisx- (23) 

Finally, this gives us an important formula that relates the conditional-on-X' ACE-difference, 
Kx), to the causal parameters, i.e., 

T(x) = p,^ - Powc - (Pite - Point) (24) 

Equation (24) can be used to justify our emphasis on the ACE parameter in the following way. 
Suppose item c is just as likely to cause members of /to make the predicted respoii.. as it is to 
cause members of r to do this for examinees with X = x. Tins means that 

Poifi - Po\u (25) 

It follows that if (25) holds then 

TX-J^) = Aofr-Aorc. (26) 

so that in this case T{x) is the excess of the probability that / causes the predicted response in 
the focal group over this probability in the reference group. Assumptions about the causal 



34 

parameters, are generally untestable, but, dq)OTding on the degree of control exercised in 
the design of the (r, cHtempair, son^e assumptions can be made plausible and thra give a diie^ 
causal interpretation to JXpc). We anphasize that (25) is not the only type of assumption that can 
arise in a randomized DIF study. 

5. EXAMPLES OF SPECIAL CONFIRMATORY STUDIES 
Randomized DIF that grew out of the two examples of observational studies discussed 
in Section 3 are described in this section. These studies either constructed items with the 
postulated factors or varied the location of the items. Other examples of DIP research 
evaluating effects of specially constructed items are: Bleistein, Schmitt, and Curley (1?90) and 
Scheuneman (1984, 1987). 



5.1 SYS'EEMATIC EVALUATION OF HISPANIC DIP FACTORS 



The puipose of the Schmitt et al. (1988) investigation was to provide a follow-up to the 
Schmitt (1985, 1988) studies through analysis of specially constructed SAT-V items in which 
the occurrence of postulated factors (tme cognates, false cognates, homographs, and special 
interest) was rigorously controlled and manipulated. Two parallel 40-item non-operational 
sections were constructed so that each item in one form is a revised but very similar version of 
the same position item in the other fonn. The standardization method was used to compare the 
performance of the White reference group and each Hispanic focal group for each item in each 
of the two special forms. An external matching criterion was used, the 85-item SAT-V 
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opcantional examiiiatioii taken in the same boddet as the specially constiucted section under 
study* Btacc, the studied itrais wexe aot included in the matching ciiterion, which is the 
appropriate course of action for a randomized DIF study. Refer to section 4.2 for an 
e^qplanation of why a studied item is not included as pait of the matching critmon or covariate. 
Estimations of DIF were corrected for q)eededness by including only those ^caminees who 
reached the it<^ in its calculation* In addition to calculating DIP values for the key, differences 
in the standardized prppoition of responses for each distiactor weie computed and evaluated in 
order to further understand the effects of the hypothesized factors on Hispanic DIF- Empirical- 
option response curves and conditional differential response-rate plots were also evaluated for 
each item comparison* 

Comparison of the DIF value obtained for one item version versus the DIF value obtained 
for the other item version indicated whether or not the postulated factor effect was supported or 
not* The most convincing support was found for the hypothesis that the true cognates facilitate 
the perfonnance of Hispanic examinees* Striking effects were found for two anton;^7n item pairs 
where the true cognates produced positive DIF values that exceeded 10% for nearly all Hispanic 
subgroups while the DIF value for the alternate neutral item indicated that the Hispanics groups 
performed slightly worse than the reference White g^.oup* The PAT J, ID/ ASHEN item pair (#7) 
presented in section 4*1 was one of these two antonym item pairs* Figure 1 presents differences 
in standardization DIF values between the item pairs test;iig the true cognate factors for the total 
Hispanic group* Confidence bands are drawn on this figure to indicate that differences greater 
than 3 % between the DIF values of the item pairs are statistically significant* Although only 
the two antonym item pairs had differences (.17 and *15 for all Hispanics) that fell outside the 
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confideQcebaiMiforiU theHisp^ some of the other itrai pain had differs 

the postulated difcction. 



Insert Figure 1 about here 



Comparison of the tnie cognates with diifermces in the postulated direction versus those with 
no apparent DIF dfect indicate that the tnie cognates that consistently made the items 
differentially easier for Hi^anics were words with a higher frequency of usage in the Spanish 
language. Because of these results, the true cognate hypothesis was revised to restrict the 
positive effect of true cognates to true cognates with a higher usage in the Spanish language than 
their usage in the English language (Schmitt & Dorans, in press). Since mixed or marginal 
results were found for the oth^r hypothesized factors the authors counseled: 

More research is needed before prescriptive or proscriptive rules 
can be devised to guide item writers. Vie true cognate items 
demonstrate clearly, however, that DIF can be manipulated, at 
least some of the time. (Schmitt et al. , 1988, p. 20) 

5.2 USING LOGISTIC REGRESSION TO ESTIMATE EFFECT SIZES 

Section 4.2 discussed the parameters of interest in a randomized DIF study ai the 
population level but did not discuss the details of how to estimate them. We now consider the 
problem of estimation. There are two parts to this discussion. The first concerns how random 
assignment of the ^)ecial items to examinees allows the basic probabilities in (5) -(7) to be 

4i 
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estimated from the data collected in a landomized DIF study. The second concerns how to use 
modem discr^ data models to estimate the causal paiameten of interest. We discuss each in 
turn. We use the data from the randomized DIF study for Hispanics, described in Section 5. 1, 
to illustrate how the procedure is used and to compare its estunates of effect size to those 
produced by standardization. 

Randomization and the Causal Parameters 

To reiterate, the various ACEs defined in (1) - (4) and the ACE-difference or interaction 
parameters, land T(.t), defined in (8) and (10) are based on these probabilities: 

P{Yj = 1), P(l^. = 1 I G = g) and P(y, - 1 I G = = x) (27) 

for; = c, r and ^ = /, r. 

However, the data that is obtained in a randomized DIF study is Ys, 5, G and X on each 
examinee. Hence the parameters that can be directly estimated in a randomized DIP study are 
not those in (27), but are, instead, 

= 1 I 5 =;-), P(r, = 1 I 5 =y, G = ^) and P(r5 = 1 I S G = ^, X = x),(28) 

which can also be expressed as 



P{Yj^l\S =;•), P{Yj = 1 I S -y, G = ^) and = 1 | 5 - y, G = ^, X = x).(29) 



38 

(Note that in (28) and (29) we have made useofthefacttfaatJirisa covariate score - otherwise 
it would be subscxqrted by j.) 

The role of random assignment of examinees to itrai r or c is that it makes the variable 
S statistically indq«ident of 1^ F^, G and X. Hrace, randomization results in the probabilities 
in (29) being req)ectively equal to those in (27) that undolie the ACEs and ACE-diiferences of 
interest to us. Thus, we may use estimates of die probabilities in (28) as the basis of our 
inferences of the causal parameter T, 7Xr), and 7^ If random assignment £ails for some reason 
then this is not true. There are a variety of ways that random assignment can £ail to be executed 
in any randomized study. An important class of such failure is "differential dropout" between 
the units assigned to each condition. In randomized DIP studies "drop-out" means that the 
examinee does not attempt to answer the special test items. Differential drop-out might occur 
between examinees assigned to item / and to item c if the location of these items in the overall 
test form is very different-i.e. r is the first item in its test form but c is the last item of its test 
form. 

Estimating the Main Effect Parameter 

Useful estimation strategies always depend on the type and extent of the available data. 
We will describe an approach, based on logistic regression, that can be used in a variety of 
situations. The main effect parameter 



E(y, ^ Y,) = PiY, = 1) . P(Y, = 1) 
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can also be e3q>ressed, by the ai^giumrat givra above, as the treatmCTt-control difference, 



= 1 I S = r) - = 1 1 5 - c). 



(30) 



in a randomized DIF study* Let draote the proportion of examinees who made the piediaed 
re^nse among those asked to re^nd to item rand let be similarly defined Then 
the difference, - p^y estimates the differaice in (30) • For example, consider the 
PALLID/ASHEN item discussed earlier. A sample of 42,033 White or Hispanic examinees 
answered the PALLID item (r) and 45,960 White or Hispanic examinees answered the ASHEN 
item (c). The proportions answering the two items correctly are, respectively, .51 and .50. The 
estimate of the main effect of items is the difference, *0L Thus, we see that, in fact, the two 
items are nearly of equal difficulty, over the subpopulation consisting of proportional 
r^resentations of self-identified White and Hispanic examinees. In this sample there were 
84,852 White examinees and 3,141 Hispanic examinees. 

It is useful to set up our notation for logistic regression now so that we can show its 
relationship to the main-effect parameter (30). Let and G* be indicator variables defmed by 



1 if 5 = 



1 ifG=/, 



(31) 



0 if 5 = c, 



0 if G = r . 



We set up a logistic regression model of the following fairly general form: 



logit [P(y,= 115, G,xn 
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+ ^^0^ + g^Y^C?^ + £>*S*G^, (32) 

where 

logit(p) = log 



and and are the model paiameteis. 

In (32) the logit is assumed to be a polynomial of degree at most a in the covaiiates score 
X and this polynomial is possibly different for each of the four combinations of S and G. 
Simplification of this general model is achieved by data analysis in which various submodels of 
(32) are examined. Polynomials in X of degree 2 or more may be used to allow for curvilinear 
logit ftmctions. For example, in the PALUD/ ASHEN example the following logistic regression 
models were found to give satisfactory fits to the data in which Xr ±e operational S AT-V score 
that does not include the studied item. 

ASHEN, White examinees: 

logit PiYs = 1 15 = c, G = r, SAT-V) 
= -1.970 - 0.458 (SAT-V) + 0.581 (SAT-V)^ 

PALLID, White examinees: 

logit PiYs = 1\S = t, G = r, SAT-V) 
= -3.885 + 3.081 (SAT-V) - 1.184 (SAT-V)^ + 0.255 (SAT-V)^ 



ERIC 



4 b 



41 

ASHEN, Hispanic examinees: 

logitPiYs = 1|5 = c, G =/, SAT-V) 
= -0.621 - L907 (SAT-V) + 0.917 (SAT-V)^ 

PALLID, Hispanic examinees: 

logit = 1|5 = r, G = /, SAT-V) 
= -1.194 + 0.174 (SAT-V) + 0.273 (SAT-V)^ 

Let ^(/, gy x) denote the estimated conditional probability (or fitted probability) that 
results from the logistic regression analysis. The fitted probabilities are related to the estimated 
logits in (32) according to the following formula. 

L0\ X) - estimated logit [{PiXs - 1 1 5 = G - X = x))] 

then 

Pif. X) = exp(L(/, g, x))l{\ + e3q)(L(/, g, x))). 

The four fitted probability functions for the estimated logits given above are displayed in Figure 
2. We see that the predicted probabilities for the PALLID item for the Hispanic group are quite 
different from the other three. 



Insert Figure 2 about here 
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Once a satisfoctoiy logistic n^resaon model is selected, we may use it to obtain various 
covariate adjusted estimates. The fitted probabilities, g, x) are estimates of the conditional 
probability, 

/K/, 8. X) = PCy^ = 1|S =y, G = X = X). (33) 

Define Pj by 

Pj = S p{j\ g, X) Tij^i E n,^, (34) 

where rij^ is the number of examinees in the study with 5 = y, G = g and X = 
Thus, pj may be viewed as an estimate of Pp the proportion of examinees in the population who 
give the predicted response to item; in the pair (/, c), that is based on the smoothed predicted 
probabilities, pQ\ g, x). However, if the submodel of (32) €}Bt is selected to rq)resent the data 
contains a© and jSo as free parameters it may be saows thu pj and the raw proportion, Pj are 
equal. Because we allowed Oq and jSo to be free in our analysis, covariance adjustment does not 
change our estimate of the main effect parameter. 

Estimating T 

The interaction parameter T defined in (8) can be estimated directly or by the use of 
covariate adjustments. Let Pj^ denote the proportion of examinees making the predicted response 
among all those exposed to itemy = r or c in group gig -f, r). The argument given in the 
earlier section shows that the difference of sample differences in proportions, 
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(35) 



csrimatfts the ACE-differences, 7. In the PALLID/ ASHEN example the four propoitions that 
make iqp T are given below. 



the value of T is therefore 

f = (.56 - .36) - (.51 - .50) 
= .19. 





(0 


(c) 




PAT.TJD 


ASHEN 


Whites (r) 


.51 


.50 


Hispanics (/) 


.56 


.36 



(36) 



A covariate adjusted estimate of T can also be obtained from the fitted probabilities 
resulting from a logistic regression analysis. Let pj^ be defined by 



(37) 



where p(j, g, x) and n^^ are as defined earlier. Then the covariate adjusted estimate of T is 



T=Ptf-Prf- (Ptr-Pcr)- 



(38) 
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If the submodel of (32) that is selected to tepitseat the data contains oq, /Sq, 7o and Xo as firee 
parameters then it may be shown that pj^ and Pj^ are equal. Because we have done this in the 
models fit to the data in the PAUID/ASHEN example, our estimates, fand f, are equal. 

Estimating T{x) 

When sample sizes are very large, a useful direct estimate of T(pc) is available. In 
analogy with (35) it is 

T\x)^P^-p^-(P^-p„) (39) 
where Pjg^ is the proportion of examinees who made the predicted response among all those for 
whom S — j, G = g and X = x. However, in practice, where samples are often small, (39) 
yields very noisy estimates of T{x) that can mask trends. Instead, a more useful approach is to 
use the fitted probabilities from the logistic regression analysis, p(j, g, x). This yields 

nx) = Pit. f, x) - Pic f. X) - (Pit, r. X) - Pic r, x)). (40) 

When j: is a univariate score, a graph of fix) versus j: is a useful summary of the results for 
items / and c of the randomized DIF study. Figure 3 shows a plot of Tix) versus x for the 
PAT J .TP/ ASHEN example in which the covariate is the SAT-V score. 

Insert Figure 3 about here 



Estimates of are easily derived from (40) via the formula 
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(41) 



for any set of weights vi<jc). In paiticular, when 



y<x) = ± (42) 



We obtain an estimate of Tp T(x) weighted by the distribution of X in the focal group. It is 



2) = E Tlx) 1 (43) 



For the PALLID/ ASHEN example 7} is .17, which agrees with the difference in standardization 
parameters reported in Schmitt et al. (1988). This agreement is due to several factors. Most 
importantly the sample size for the Hispanic groups who responded to the / and c items were 
sufficiently large (1,619 and 1,522, respectively) that the distribution of the covariate scores in 
these two groups were similar to the distribution obtained by pooling them. In addition, the 
curves reported in Figure 2 are the result of careful data analysis and rq)resent the noisy raw 
proportions in the data very well. Finally, the estimate of the standardization parameter is based 
on an external matching criterion that is the same as that used in the logistic regression analyses 
reported here-the SAT-V score. We note that the use of an external matching criterion that 
does not include the studied item is generally not an appropriate procedure for measuring the 
amount of DEF exhibited in an item, but in this case it is appropriate since the parameter of 
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ultimate interest is the average interaction paiameter, Tp given in (18), rather than the DIP 
values above. 

SUMMARY OF STEPS FOR USING LOGISTIC REGRESSION 
IN A RANDOMIZED DIF STUDY 

The theory and practice of logistic regression are now fairly well established. The 
discussions in Cox (1970) and in Hosmer and Lemeshow (1989) arc very he^ftil and software 
is available in the SAS, SPSS and BMDP packages. We suggest tiie following checklist for tiie 
use of logistic regression in the analysis of data from randomized DIF studies: 



• Be sure that the variables used as covariate scores are, in fact, covariates-i.e., they are 
unaffected by whether the examinee was exposed to tiie / or c item. 

• Consider including as many covariate scores as possible in tiie analysis- e.g., math as 
well as verbal scores, or subscores such as rights sed omits on formula-scored tests. 

• Consider including powers higher than linear or quadratic terms in order to improve the 
fit of the model. 
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Stan with a lai^e model like that in (32)» and then simplify it to the point where there 
are as few paiameteis as possible without a degradation in the fit. 

Check the fit of the model in at least the following two ways: 

a) See if the inclusion of a term in the model adds substantially to its fit as measured 
by the standard one-degree-of-fitedom likelihood ratio test 

b) Plot the fitted proportions from the model along with the observed proportions as 
functions of the covariate scores for each combination of group {J or r) and item 
(/ or c). The fitted prc^rtions should go through the middle of the scatter of 
observed proportions. 

In addition, check residuals from the model for outliers, remove them to see if they are 
responsible for unusual features of the resulting model. 

Remember that the point of this careful data analysis is to find a smooth function of the 
covariate score(s) that adequately smoothes the noisy observed proportions, p^^. 

Use the fitted proportions, pij, g, Jt), to compute Tlx) and single number summaries like 

Do not concentrate on interpreting the coefficients of the finally selected logistic 
regression model, i.e. a^, yj, or because these are in the logit scale. Rather, 
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compare the four functions p(t, r, x), pic, r, x), pit, f, x) and pic /, x) via plots, as in 
Figure 2, and inteipret these diffirarraces. 

5,3 DIFFERENTIAL SPEEDEDNESS ASSESSED UNDER CONTROLLED CONDITIONS 

An example of a q)ecial confirmatory study where the location of the items was varied 
is the Dorans, Schmitt and Curiey (1988) study. This study examined directly how differential 
speededness affects the magnitude of DIF, In addition, it evaluated how weU the procedure of 
excluding not reached examinees from the calculation of the DIF statistic adjusts for the effects 
of differential speededness. The purpose of the study was to answer two questions: 

• Does an item's DIF value dq)end upon its location in the test? 

• If so, can the item location effect be removed via a statistical adjustment of the DIF 
statistic? 

For a detailed description of the study see Dorans, Schmitt and Curiey (1988), 

For the purposes of their study, one non-operational 45-item and one non-operational 
40-item SAT-Veibal pretest were labelled "Form A" and "Form B", respectively. The ten 
analogy items appearing in Form A in positions 36 to 45 were combined with the antonyms, 
sentence completions, and reading comprehension items from Form B to create "Form C", a 
40-item section in which the ten analogies aqppeaied in positions 16 to 25, Similarly, the analogy 
items from Form B in positions 16 to 25 were combined with the antonyms, sentence 
completions, and reading comprehension items from Form A to create "Form D", a 45-itera test 
in which the ten analogies were shifted to the end of the section in positions 36 to 45, 
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TMs particular design afToided an oppoitunity to examine how diiferratial speededness 
for Blacks affects the magnitude of DIF statistics on two sets of analogy it^s. As a control 
analysis, Dorans, Schmitt and Oxdey (1988) also conducted differratial i^)eededness and DIF 
analyses for finales on the same sets of analogy items. Standardized distractor analyses (Do^s 
& Holland, in press) that focused on not reached were used to assess differential speededness. 

Figures 4 and S depict the degree of differattial speededness observed for Blades and 
females on the Form A and Fonn D analogy items, respectively. In these figures, the STD 
P-DIF(NR) values, in percentage units, are plotted against item number. Absolute values of 5 % 
or greater indicate a sizeable degree of differential speededness. A positive STD P-DIF(NR) 
value means that the focal group, Blacks or females, is not reaching the item to the degree that 
the base or reference group. Whites or males, is. Conversely, a negative STD P-DIF(NR) value 
means the focal group is reaching the item in greater proportions than the matched base group. 



Insert Figures 4 and 5 about here 



In Figure 4, there is little evidence of differential speededness for females. For Blacks, 
there is some evidence, particularly on items 42 and 43, and possibly 40 and 41. In Figure 5, 
for females, item 44 is approaching the 5 % cutoff. For Blacks, differential speededness is quite 
pronounced. Items 41, 42, 43, and 44 are at or above the 5% value, while items 38, 39, and 
40 are approaching the 5% value. Note that across Figures 4 and 5 all but one item has a 
positive STD P-DIF(NR) value for Blacks, indicating that Blacks reach items at the end of the 
45-item Verbal 1 section at a slower rate than a matched group of Whites, as reported by 
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Donms, Schmitt and Bleistdn (1988). In amtnst, the STD P-DIF(NR) values for fonales are 
either at 0 (9 of 20 hems) or slightly negative, indicating that fnnales get fuither into the test 
tiian a matched group of males. 

There are no figures for the analogy items on Fonns B or C because there is no 
diifejnratial ^)eededness on the analogy items in positions 16 to 25 of the 40-item fotmat. In 
fact, all examinees reached these items. 

A major goal of the Dorans, Schmitt and Curley r^earch was to ascertain whether or not 
there was a position dfect on DIF statistics. Evidence has been presented for a differential 
speededness effect for Blacks, and of negative DIP, predominantly for Blacks on the earlier, 
easier analogy items. In addition item position effects were r^rted. 

Does an item's DIF value depend on its location in the test? Dorans, Schmitt and Curley 
(1988) reported that the answer is yes for some items, particularly when one position is subject 
to a differential speededness effect while the other is not. 

The second question to be addressed by the Dorans, Schmitt and Curley research was: 
Can the item location effect be removed via a statistical adjustment? In particular, does 
exclusion of the candidates who do not reach the item from calculation of the DIP statistic 
produce a statistic that is less sensitive to position? All things considered, the adjustment for 
not reached tended to dampen the position effect for most items. It did not, however, 
statistically remove completely the speededness effect. 
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6. SUMMARY 



This paper provided prcscrqitions for the practice of conducting research into the 
evaluation of DIP hypotheses^ Advice was givwi for both observational studies with operational 
item data and controlled studies with specially constructed items* The following checklist can 
be used to guide the conduct of observational studies* 

SUMMARY OF STEPS IN TEIE EVALUATION OF DDF HYPOTHESES USING 
OBSERVATIONAL DATA 

• Operationalize the definition of the postulated DIF factors in order to permit the objective 
classification of items. 

• Classify all items in accordance with postulated DIF factors. 

• Define the appropriate focal and reference groups. 

• Select appropriate samples. 

• Determine the matching criterion considering dimensionality, reliability, and criterion 
refiinement issues. 

• Determine what statistical adjustments are relevant (e.g., speededness and omission). 

• Selea an appropriate DIF estimate based on the above considerations. 

• Calculate DIF statistics for the key, distractors, and response style faaors. 

• Evaluate relevant information provided by distractor and difference plots. 

• Summarize DIF information by the postulated factors; use descriptive statistics (e.g., 
correlate comparable DDF outcomes with hypothesized factors using appropriate statistical 
methods). 
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• Determine whether the DIF infonnation supports the hypothesized DIF factors. 
Section 4.2 described the ludimeots of a theory of causal infeirace, the success of which 

hinges on putting the J. S. Mill quote from neariy ISO years ago into action by measuring 
causation through e3q)erimratal manqnilation. Section 5.2 describes the q)eciiics of one 
approach towaids a accomplishing this. Hie following checklist can be used to guide future 
randomized DIP ^dies. 

SUMMARY OF STEPS IN THE CONFIRMATORY EVALUATtON OF DIF HYPOTHESES 
USING SPECIALLY CONSTRUCTED ITEMS 

• Construct sets of items (treatment and control) in accordance with postulated DIF factors; 
control extraneous factors to the extent possible. 

• Define the focal and reference groups and randomly determine control and tieatment 
subgroups. 

• Select appropriate sample sizes; rq)licate administrations when needed in order to obtain 
sufficient sample sizes. 

• Determine the matching criterion that is a covariate in the sense used here; use an 
external matching criterion when possible. Consider dimensionality and criterion 
refinement issues. 

• Specify what statistical adjustments are relevant (eeg., speededness and omission). 

• Calculate DIF statistics for the key, distractors, and response style factors. 

• Evaluate relevant infonnation provided by distractor and difference plots. 

• Summarize DIF information by the postulated factors; use descriptive and inferential 
statistics. 
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• Detennine wh^hn* the DIF infonnatiQn siq)poits the hypothesized DIP factors. 

Tliese randomized DIF studies axe distinguished by the caxeful construction of hypothesis 
items and their controls, the control of extraneous factors, the use of randomization, and the 
quest for adequate samples to achieve oiough statistical power to detect affects related to the 
DIF hypotheses. If these conditions are met in practice, th^ DIF findings, if replicated, may 
suggest changes in educational assessment and practice. Evaluation of DIF hypotheses is 
complicated however by a variety of practical and ethical considerations. Sound scientific 
method needs to operate within these constraints and achieve success in advancing knowledge 
that will affect test development practice, assessment, and educational practice. 
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Figure 2: 



FITTED PROBABILITffiS FOR THE "PALLID/ ASHEN" ITEM PAIR 
REFERENCE GROUP=WHITE, FOCAL GROUP=HISPANIC 
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Figure 3: 

PLOT OF T(X) VS X 
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Figure 4; Differential Speededness On Form A 
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Figure 5: Differential Speededness On Form D 
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